Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend) by alexey-milovidov · Pull Request #861 · ClickHouse/ClickBench

alexey-milovidov · 2026-05-07T12:24:12Z

Summary

Adds a spark-gluten-clickhouse/ entry that runs Apache Spark with Apache Gluten configured to use the ClickHouse backend (spark.gluten.sql.columnar.backend.lib=ch). Gluten loads libch.so (a fork of ClickHouse v23.1) into the Spark executor JVM and runs the columnar physical plan natively through it.
Complements spark-gluten/ (Velox backend) and the proposed spark-velox/ (Add spark-velox entry (Spark + Velox via Apache Gluten) #858) — this entry exercises a meaningfully different execution path: Catalyst → Substrait → ClickHouse engine, rather than Catalyst → Substrait → Velox.

Build

No pre-built bundle is published for the CH backend (Apache Gluten v1.4.0 ships only the Velox bundle, and Maven Central has nothing). benchmark.sh therefore builds two things from source:

libch.so — built from Kyligence/ClickHouse at the branch pinned in gluten/cpp-ch/clickhouse.version (currently rebase_ch/20250326). Uses Clang 18 / cmake / ninja.
The Gluten Spark plugin — built via Maven with -Pbackends-clickhouse,spark-3.5,scala-2.12 under JDK 8.

Limitations

The libch.so compile is essentially a ClickHouse build and is RAM-hungry; Gluten's docs recommend ≥64 GB. On c6a.4xlarge (32 GB) it may OOM — c6a.8xlarge or larger is recommended for a clean run, hence the default machine label in benchmark.sh.
ARM is untested. Both ClickHouse and the Gluten plugin should compile on aarch64 in principle, but the Gluten CI does not publish CH-backend artifacts for ARM.

Notes

Queries use ClickHouse-style regex backreferences (\1) rather than Spark's $1, because regex evaluation runs inside libch.so. This was anticipated in the existing spark-gluten/README.md and Gluten issue #7545.
Memory split between Spark heap and the Gluten off-heap pool is 50/50, identical to the Velox entry — the CH backend also runs off-heap via JNI.

Test plan

Run on an x86_64 c6a.8xlarge.
Verify benchmark.sh clones gluten + Kyligence/ClickHouse, builds libch.so and the Spark plugin, runs all 43 queries, and writes results/<machine>.json.

🤖 Generated with Claude Code

Adds a spark-gluten-clickhouse/ entry that runs the ClickBench query suite against Apache Spark with Apache Gluten configured to use the ClickHouse backend ('ch'), in which Gluten loads libch.so (a fork of ClickHouse v23.1) into the Spark executor JVM and runs the columnar plan natively through it. Compared with spark-gluten/ (which uses the Velox backend), this exercises a meaningfully different execution path: Catalyst -> Substrait -> ClickHouse engine, rather than Catalyst -> Substrait -> Velox. No pre-built bundle is published for the CH backend (the Apache Gluten release tarball ships only the Velox bundle), so benchmark.sh builds both libch.so and the Gluten Spark plugin from source. The build is memory-hungry; a 64 GB host (c6a.8xlarge or larger) is recommended. Queries use ClickHouse-style regex backreferences (\1) since the regex evaluation runs inside libch.so, as anticipated in the spark-gluten/ README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend)#861

Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend)#861
alexey-milovidov wants to merge 1 commit intomainfrom
add-spark-gluten-clickhouse

alexey-milovidov commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexey-milovidov commented May 7, 2026

Summary

Build

Limitations

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant